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Abstract 



We introduce a natural generalization of submodular set cover and exact active learning with a finite hypothesis 
class (query learning). We call this new problem interactive submodular set cover. Applications include advertising 
in social networks with hidden information. We give an approximation guarantee for a novel greedy algorithm and 
give a hardness of approximation result which matches up to constant factors. We also discuss negative results for 
simpler approaches and present encouraging early experimental results. 

1 Introduction 

As a motivating example, we consider viral marketing in a social network. In the standard version of the problem, the 
goal is to send advertisements to influential members of a social network such that by sending advertisements to only 
a few people our message spreads to a large portion of the network. Previous work [13, 12] has shown that, for many 
models of influence, the influence of a set of nodes can be modelled as a submodular set function. Therefore, selecting 
a small set of nodes with maximal influence can be posed as a submodular function maximization problem. The related 
problem of selecting a minimal set of nodes to achieve a desired influence is a submodular set cover problem. Both of 
these problems can be approximately solved via a simple greedy approximation algorithm. 

Consider a variation of this problem in which the goal is not to send advertisements to people that are influential 
in the entire social network but rather to people that are influential in a specific target group. For example, our target 
group could be people that like snowboarding or people that listen to jazz music. If the members of the target group 
are unknown and we have no way of learning the members of the target group, there is little we can do except assume 
every member of the social network is a member of the target group. However, if we assume the group has some 
known structure and that we receive feedback from sending advertisements (e.g. in the form of ad clicks or survey 
responses), it may be possible to simultaneously discover the members of the group and find people that are influential 
in the group. 

We call problems like this learning and covering problems. In our example, the learning aspect of the problem 
is discovering the members of the target group (the people that like snowboarding), and the covering aspect of the 
problem is to select a small set of people that achieve a desired level of influence in the target group (the people to 
target with advertisements). Other applications have similar structure. For example, we may want to select a small set 
of representative documents about a topic of interest (e.g. about linear algebra). If we do not initially know the topic 
labels for documents, this is also a learning and covering problem. 

We propose a new problem called interactive submodular set cover that can be used to model many learning 
and covering problems. Besides addressing interesting new applications, interactive submodular set cover directly 
generalizes submodular set cover and exact active learning with a finite hypothesis class (query learning) giving new 
insight into many previous theoretical results. We derive and analyze a new algorithm that is guaranteed to perform 
approximately as well as any other algorithm and in fact has the best possible approximation ratio. Our algorithm 
considers simultaneously the learning and covering parts of the problem. It is tempting to try to treat these two parts 
of the problem separately for example by first solving the learning problem and then solving the covering problem. 
We prove this approach and other simple approaches may perform much worse than the optimal algorithm. 
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2 Background 



2.1 Submodular Set Cover 

A submodular function is a set function satisfying a natural diminishing returns property. We call a set function F 
defined over a ground set V submodular iff for all A C B C V and v ^V\B 

F(A + v)-F(A)>F{B + v)-F(B) (1) 

In other words, adding an element to A, a subset of B, results in a larger gain than adding the same element to B. F is 
called modular if Equation 1 holds with equality. F is monotone non-decreasing if for all A C B C V, F( A) < F(B). 
Note that if F is monotone non-decreasing and submodular iff Equation 1 holds for all v G V (including v G B). 

Proposition 1. If F\{S), F 2 (S), ...F n (S) are all submodular, monotone non-decreasing functions then F\{S) + 
F2(S) + ... + F n (S) is submodular, monotone non-decreasing. 

Proposition 2. For any function f mapping set elements to real numbers the function F(S) = max se s f(s) is a 
submodular, monotone non-deceasing function. 

In the submodular set cover problem the goal is to find a set S C V minimizing a modular cost function c(S) = 
J2ses c ( s ) subject to the constraint F(S) = F(V) for a monotone non-decreasing submodular F. 

I Submodular Set Cover 
Given: 

• Ground set V 

• Modular cost function c defined over V 

• Submodular monotone non-decreasing objective function F defined over V 
I Objective: Minimize c(S) such that F(S) = F(V) 

This problem is closely related to the problem of submodular function maximization under a modular cost con- 
straint c(S) < k for a constant k. A number of interesting real world applications can be posed as submodular set 
cover or submodular function maximization problems including influence maximization in social networks [12], sen- 
sor placement and experiment design [14], and document summarization [15]. In the sensor placement problem, for 
example, the ground set V corresponds to a set of possible locations. An objective function F(S) measures the cov- 
erage achieved by deploying sensors to the locations corresponding to S C V. For many reasonable definitions of 
coverage, F(S) turns out to be submodular. 

Submodular set cover is a generalization of the set cover problem. In particular, set cover corresponds to the case 
where each v G V is a set of items taken from a set U^gy v - The goal is to find a small set of sets S C V such 
that | Uses s \ = I Uvev v \- The function F(S) — | {J seS s\ is monotone non-decreasing and submodular, so this is 
a submodular set cover problem. As is the case for set cover, a greedy algorithm has approximation guarantees for 
submodular set cover [18]. In particular, if F is integer valued, then the greedy solution is within H (m&x ve y F({v})) 
of the optimal solution where H(k) is the fcth harmonic number. Up to lower order terms, this matches the hardness 
of approximation lower bound (1 — o(l)) Inn [7] where n = \ {J veV v\ = F(V). 

We note a variation of submodular set cover uses a constraint F(S) > a for a fixed threshold a. This variation 
does not add any difficulty to the problem because we can always define a new monotone non-decreasing submodular 
function F(S) = mm(F(S), a) [14, 16] to convert the constraint F (S) > a into a new constraint F(S) = F(V). We 
can also convert in the other direction from a constraint F(S) = F(V) to F(S) > a by setting a = F(V). Without 
loss of generality or specificity, we use the variation of the problem with an explicit threshold F(S) > a. 

2.2 Exact Active Learning 

In the exact active learning problem we have a known finite hypothesis class given by a set of objects H, and we want 
to identify an initially unknown target hypothesis h* G H. We identify h* by asking questions. Define Q to be the 
known set of all possible questions. A question q maps an object h to a set of valid responses q(h) C R with q(h) / 
where R = {J qe q hen ^ s tne set °f a ^ possible responses. We know the mapping for each q (i.e. we know q(h) 
for every q and h). Asking q reveals some element r G q{h*~) which may be chosen adversarially (chosen to impede 
the learning algorithm). Each question q G Q has a positive cost c(q) defined by the modular cost function c. 
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The goal of active learning is to ask a sequence of questions with small total cost that identifies ft*. By identifying 
ft*, we mean that for every ft ^ h* we have received some response r to a question q such that r £ q(h). Questions 
are chosen sequentially so that the response from a previous question can be used to decide which question to ask next. 
The problem is stated below. 

Exact Active Learning 
Given: 

• Hypothesis class H containing an unknown target ft* 

• Query set Q and response set R with q(h) C R for q G Q, h G H 

• Modular query cost function c defined over Q 

Repeat: Ask a question ft G Q and receive a response fj G ft (ft*) 

Until: ft* is identified (for every ft G H with ft ^ ft* there is a (ft, r^) with fj ^ ft (ft)) 

Objective: Minimize J2i c (ft) 

In a typical exact learning problem, H is a set of different classifiers and ft* is a unique zero-error classifier. 
Questions in Q can, for example, correspond to label (membership) queries for data points. If we have a fixed data set 
consisting of data points x i7 we can create a question ft corresponding to each Xi and set ft (ft) = {ft(a;j)}. Questions 
can also correspond to more complicated queries. For example, a question can ask if any points in a set are positively 
labelled. The setting we have described allows for mixing arbitrary types of queries with different costs. 

For a set of question-response pairs S, define the version space V(S) to be the subset of H consistent with S 

V(S) ^{heH: V(ft r) G S, r G q(h)} 

In terms of the version space, the goal of exact active learning is to ask a sequence of questions such that F(S') = 1. 

We note that the assumption that H and Q are finite is not a problem for many applications involving finite data 
sets. In particular, if we have an infinite a hypothesis class (e.g. linear classifiers with dimension d) and a finite data 
set, we can simply use the effective hypothesis class induced by the data set [4]. On the other hand, the assumption that 
we have direct access to the target hypothesis (every f j is in ft (ft*)) and that the target hypothesis is in our hypothesis 
class (ft* G H) is a limiting assumption. Stated differently, we assume that there is no noise and that the hypothesis 
class is correct. 

Building on previous work [3], Hanneke [10] showed that a simple greedy active learning strategy is approximately 
optimal in the setting we have described. The greedy strategy selects the question which relative to cost distinguishes 
the greatest number of hypotheses from ft*. Hanneke [10] shows this strategy incurs no more than In \H\ times the 
cost of any other question asking strategy. 

The algorithms and approximation factors for submodular set cover and exact active learning are quite similar. 
Both are simple greedy algorithms and the \nF(V) approximation for submodular set cover is similar to the In \H\ 
approximation for active learning. These similarities suggest these problems may be special cases of some other more 
general problem. We show that in fact they are special cases of a problem which we call interactive submodular set 
cover. 

3 Problem Statement 

We use notation similar to the exact active learning problem we described in the previous section. Assume we have a 
finite hypothesis class H containing an unknown target hypothesis ft* G H. We again assume there is a finite set of 
questions Q, a question q maps each object ft to a set of valid responses q(h) C R with q(h) ^ 0, and each question 
q G Q has a positive cost c(q) defined by the modular cost function c. We also again assume that we know the mapping 
for each q (i.e. we know q(h) for every q and ft). Asking q reveals some adversarially chosen element r G q(h*). In the 
exact active learning problem the goal is to identify ft* through questions. In this work we consider a generalization 
of this problem in which the goal is instead to satisfy a submodular constraint that depends on ft* . 

We assume that for each object ft there is a corresponding monotone non-decreasing submodular function Fh 
defined over subsets of Q x R (sets of question-response pairs). We repeatedly ask a question ft and receive a 
response fj. Let the sequence of questions be Q = (qi, ft>, . . . ) and sequence of responses be R — (f\,f2, ■ ■ ■ )■ 
Define S — |J- .^{(ft^i)} to be the final set of question-response pairs corresponding to these sequences. Our 

goal is to ask a sequence of questions with minimal total cost c(Q) which ensures Fh* (S) > a for some threshold a 
without knowing ft* beforehand. We call this problem interactive submodular set cover. 
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Interactive Submodular Set Cover 
Given: 

• Hypothesis class H containing an unknown target h* 

• Query set Q and response set R with known q(h) C R for every q E Q, h E H 

• Modular query cost function c defined over Q 

• Submodular monotone non-decreasing objective functions Fh for h E H defined over Q x R 

• Objective threshold a 

Repeat: Ask a question qi E Q and receive a response fj E q~i{h*) 
Until: F h , (S) > a where S = (JJfe, h)} 
Objective: Minimize J2i c (%) 

Note that although we know the hypothesis class H and the corresponding objective functions Fh, we do not 
initially know h*. Information about h* is only revealed as we ask questions and receive responses to questions. 
Responses to previous questions can be used to decide which question to ask next, so in this way the problem is 
"interactive." Furthermore, the objective function for each hypothesis F h is defined over sets of question-response 
pairs (as opposed to, say, sets of questions), so when asking a new question we cannot predict how the value of F h will 
change until after we receive a response. The only restriction on the response we receive is that it must be consistent 
with the initially unknown target h* . It is this uncertainty about h* and the feedback we receive from questions that 
distinguishes the problem from submodular set cover and allows us to model learning and covering problems. 

3.1 Connection to Submodular Set Cover 

If we know h* (e.g. if \H\ = 1) and we assume \q(h)\ = 1 Vg E Q, h E H (i.e. that there is only one valid response to 
every question), our problem reduces exactly to the standard submodular set cover problem. Under these assumptions, 
we can compute Fh*(S) for any set of questions without actually asking these questions. Krause et al. [14] study 
a non-interactive version of interactive submodular set cover in which \q(h)\ = 1 Vq E Q,h E H and the entire 
sequence of questions must be chosen before receiving any responses. This restricted version of the problem can also 
be reduced to standard submodular set cover Krause et al. [14]. 

3.2 Connection to Active Learning 

Define 

F h (S)±F(S) = \H\V(S)\ 

where V(S) is again the version space (the set of hypotheses consistent with S). This objective is the number of 
hypotheses eliminated from the version space by S. 

Lemma 1. Fh(S) = \H\ V(S) \ is submodular and monotone non-decreasing 

Proof. To see this note that we can write F h as F h (S) = J2h'eH max ( g ,r)eS fh'((l, r)) where fh'{{q,r)) = 1 if 
r ^ q(h') and else fh>((q, r)) = 0. The result then follows from Proposition 1 and Proposition 2. □ 

For this objective, if we set a = \H \ — 1 we get the standard exact active learning problem: our goal is to identify 
h* using a set of questions with small total cost. Note that in this case the objective Fh does not actually depend on h 
(i.e. Fh = F^ for all h, h' E H) but the problem still differs from standard submodular set cover because Fh{S) is 
defined over question-response pairs. 

Interactive submodular set cover can also model an approximate variation of active learning with a finite hypothesis 
class and finite data set. Define 

F h (S)±\H\V(S)\(\X\- K )+ rmn(\X\- K ,J2l(h'(x)=h(x))) 

h'ev(s) zex 

where / is the indicator function, X is a finite data set, and n is an integer. 

Proposition 3. Fh* (S) = \H\(\X\ — n) iff all hypotheses in the version space make at most k mistakes. 
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Lemma 2. F h (S) = \H\ V(S)\(\X\ - n) + J2h>ev(S) min( | J%T | 

— K ' *l2xex I(h'(x) — h(x))) is submodular and 

monotone non-decreasing 

Proof. We can write F h as F h (S) = J2h>eH max ( ? , r )es h((Q, r )) where f h >((q, r)) = \X\ - k if r q(h') and else 
fh'((q, r)) = min(|X| — k, J2 xeX I(h'(x) — h(x))). The result then follows from Proposition 1 and Proposition 
2. □ 

For this objective, if we set a — \H\(\X\ — n) then our goal is to ask a sequence of questions such that all 
hypotheses in the version space make at most k mistakes. Balcazar et al. [3] study a similar approximate query 
learning setting, and Dasgupta et al. [5] consider a slightly different setting where the target hypothesis may not be in 
H. 



3.3 Connection to Adaptive Submodularity 

In concurrent work, Golovin and Krause [9] show results similar to ours for a different but related class of problems 
which also involve interactive (i.e. sequential, adaptive) optimization of submodular functions. What Golovin and 
Krause call realizations correspond to hypotheses in our work while items and states correspond to queries and re- 
sponses respectively. Golovin and Krause consider both average-case and worst-case settings and both maximization 
and min-cost coverage problems. In contrast, we only consider worst-case, min-cost coverage problems. In this sense 
our results are less general. 

However, in other ways our results are more general. The main greedy approximation guarantees shown by 
Golovin and Krause require that the problem is adaptive submodular; adaptive submodularity depends not only on the 
objective but also on the set of possible realizations and the probability distribution over these realizations. In contrast 
we only require that for a fixed hypothesis the objective is submodular. Golovin and Krause call this pointwise 
submodularity. Pointwise submodularity does not in general imply adaptive submodularity (see the clustered failure 
model discussed by Golovin and Krause). 

In fact, for problems that are pointwise modular but not adaptive submodular, Golovin and Krause show a hardness 
of approximation lower bound of 0{\Q\ 1 ~ e ); we note this does not contradict our results as their proof is for average- 
case cost and uses a hypothesis class with \H\ = 2^. Golovin and Krause also propose a simple non greedy 
approach with explicit explore and exploit stages; this approach requires only a weaker assumption that the value of 
the exploitation stage is adaptive submodular with respect to exploration. However, it is not immediately obvious when 
this condition holds, and it is also not clear how to apply this approach to worst-case or min-cost coverage problems. 

There are other smaller differences between our problem settings: we let queries map hypotheses to sets of valid 
responses (in general \q(h)\ > 1) while Golovin and Krause define realizations as maps from items to single states. 
Also, in our work we allow for non uniform query costs (in general c(qi) ^ c(qj)) while Golovin and Krause require 
that every item has the same cost (Golovin and Krause do however mention that the extension to non uniform costs is 
straightforward). We finally note that the proof techniques we use are quite different. 

Some other previous work has also considered interactive versions of covering problems in an average-case model 
[1, 8]. The work of Asadpour et al. [1] is perhaps most similar and considers a submodular function maximization 
problem over independent random variables which are sequentially queried. The setting considered by Golovin and 
Krause [9] strictly generalizes this setting. Streeter and Golovin [17] study an online version of submodular function 
maximization where a sequence of submodular function maximization problems is solved. This problem is related in 
that it also involves learning and submodular functions, but the setting is very different than the one studied here where 
we solve a single interactive problem as opposed to a series of non-interactive problems. 



4 Example 

In the advertising application we described in the introduction, the target hypothesis h* corresponds to the group of 
people we want to target with advertisements (e.g. the people that like snowboarding), and the hypothesis class H 
encodes our prior knowledge about h*. For example, if we know the target group forms a small dense subgraph in the 
social network, then the hypothesis class H would be the set of all small dense subgraphs in the social network. The 
query set Q and response set R correspond to advertising actions and feedback respectively, and finally the objective 
function Fh measures advertising coverage within the group corresponding to h. 
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Figure 1 : A cartoon example social network. 



To make the discussion concrete, assume the advertiser sends a single ad at a time and that after a person is sent 
an ad the advertiser receives a binary response indicating if that person is in the target group (i.e. likes snowboarding). 
Let Qi correspond to sending an ad to user i (i.e. node £), and qi(h) = {1} if user i is in group h and Qi(h) — {0} 
otherwise. For our coverage goal, assume the advertiser wants to ensure that every person in the target group either 
receives an ad or has a friend that receives an ad. We say a node is "covered" if it has received an ad or has a neighbor 
that has received an ad. This is a variation of the minimum dominating set problem, and we use the following objective 

Fh(S) 4 l(v G V§ or 3s G V 6 : (v, s) e E) + \V \ V h \ 

vev h 

where V and E are the nodes and edges in the social network, V% is the set of nodes in group h, and V§ is the set of 
nodes corresponding to ads we have sent. With this objective Fh* (S) = \V\ iff we have achieved the stated coverage 
goal. 

Lemma 3. Fh(S) — I{v S V§ or3s e Vg : (v, s) G E) + \V \ Vh\ is submodular and monotone non- 

decreasing. 

Proof. We can write Fh{S) as Fh{S) = Y^vev max (g f)es /«((?> 0) where /«((?> r)) = 1 if the action q covers v or 
v £ Vh and f v ((q, r)) = otherwise. The result then follows from Proposition 1 and Proposition 2. □ 

Figure 1 shows a cartoon social network. For this example, assume the advertiser knows the target group is one 
of the four clusters shown (marked A, B, C, and D) but does not know which. This is our hypothesis class H. The 
node marked v is initially very useful for learning the members of the target group: if we send an ad to this node, no 
matter what response we receive we are guaranteed to eliminate two of the four clusters (either A and B or C and 
D). However, this node has only a degree of 2 and therefore sending an ad to this node does not cover very many 
nodes. On the other hand, the nodes marked x and w are connected to every node in clusters B and D respectively, x 
(resp. w) is therefore very useful for achieving the coverage objective if the target group is B (resp. D). An algorithm 
for learning and covering must choose between actions more beneficial for learning vs. actions more beneficial for 
covering (although sometimes an action can be beneficial for both to a certain degree). The interplay between learning 
and covering is similar to the exploration-exploitation trade-off in reinforcement learning. In this example an optimal 
strategy is to first send an ad to v and then cover the remaining two clusters using two additional ads for a worst case 
cost of 3. 

A simple approach to learning and covering is to simply ignore feedback and solve the covering problem for all 
possible target groups. In our example application the resulting covering problem is a simple dominating set problem 
for which we can use standard submodular set cover methods. We call this the Cover All strategy. This approach is 
suboptimal because in many cases feedback can make the problem significantly easier. In our synthetic example, any 
strategy not using feedback must use worst case cost of 4: four ads are required to cover all of the nodes in the four 
clusters. Theorem 4 in Section 6 proves that in fact there are cases where the best strategy not using feedback incurs 
exponentially greater cost than the best strategy using feedback. 

Another simple approach is to solve the learning problem first (identify h*) and then solve the covering problem 
(satisfy Fh* (S)). We can use, for example, query learning to solve the learning problem and then use standard 
submodular set cover to solve the covering problem. We call this the Learn then Cover strategy. This approach turns 
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out to match the optimal strategy in the example given by Figure 1 . In this example the target group can be identified 
using 2 queries by querying v then w if the response is 1 and x if the response is 0. After identifying the target group, 
the target group can be covered in at most one more query. However, this approach is not optimal for other instances 
of this problem. For example, if we were to add an additional node which is connected to every other node then the 
covering problem would have a solution of cost 1 while the learning problem would still require cost of 2. Theorem 3 
in Section 6 shows that in fact there are examples where solving a learning problem is much harder than solving the 
corresponding learning and covering problem. We therefore must consider other methods for balancing learning and 
covering. 

We note that this problem setup can be modified to allow queries to have sometimes uninformative responses; this 
can be modeled by adding an additional response to R which corresponds to a "no-feedback" response and including 
this response in the set of allowable responses (q(h)) for certain query-hypothesis pairs . However, care must be taken 
to ensure that the resulting problem is still interesting for worst-case choice of responses; if we allow "no-feedback" 
responses for every question-hypothesis pair, then the in the worst-case we will never receive any feedback, so a worst 
case optimal strategy could ignore all responses. 

5 Greedy Approximation Guarantee 

We are interested in approximately optimal polynomial time algorithms for the interactive submodular set cover prob- 
lem. We call a question asking strategy correct if it always asks a sequence of questions such that Fh* (S) > a where S 
is again the final set of question-response pairs. A necessary and sufficient condition to ensure Fh* (S) > a for worst 
case choice of h* is to ensure mm heV ,g\ Fh(S) > a where V(S) is the version space. Then a simple stopping con- 
dition which ensures a question asking strategy is correct is to continue asking questions until min^y^ Ffi(S) > a. 
We call a question asking strategy approximately optimal if it is correct and the worst case cost incurred by the strategy 
is not much worse than the worst case cost of any other strategy. 

As discussed informally in the previous section, it is important for a question asking strategy to balance between 
learning (identifying h*) and covering (increasing Fh*). Ignoring either aspect of the problem is in general suboptimal 
(we show this formally in Section 6). We propose a reduction which converts the problem over many objective 
functions Fh into a problem over a single objective function F a that encodes the trade-off between learning and 
covering. We can then use a greedy algorithm to maximize this single objective, and this turns out to overcome the 
shortcomings of simpler approaches. This reduction is inspired by the reduction used by Krause et al. [14] in the 
non-interactive setting to convert multiple covering constraints into a single covering constraint. 

Define 

F a (S) 4 (l/\H\)( ]T min(a,F h (S)) + a\H\V(S)\) 

hev(s) 

F a (S) > a iff Fh(S) > a for all h G V(S) so a question asking strategy is correct iff it satisfies F a (S) > a. This 
objective balances the value of learning and covering. The sum over h G V(S) measures progress towards satisfying 
the covering constraint for hypotheses h in the current version space (covering). The second term a\H \ V(S)\ 
measures progress towards identifying h* through reduction in version space size (learning). Note that the objective 
does not make a hard distinction between learning actions and covering actions. In fact, the objective will prefer 
actions that both increase Fh(S) for h G V(S) and decrease the size of V(S). Crucially, F a retains submodularity. 

Lemma 4. F a is submodular and monotone non-decreasing when every Fh is submodular and monotone non- 
decreasing. 

Proof. Note that the proof would be trivial if the sum were over all h G H. However, since the sum is over a subset 
of H which depends on S, the result is not obvious. We can write F a as F a (S) — (l/|iJ|) *}2 heH Fa,h(S) where 
we define F a ^ h (S) = I(h G V(S)) min(a, F h (S)) + I(h V(S))a. It is not hard to see F a ^ h is monotone non- 
decreasing. We show F a ,h is also submodular and the result then follows from Proposition 1. Consider any (q, r) ^ B 
and A C B C (Q x R). We show Equation 1 holds in three cases. Here we use as short hand Gain(F, S, s) = 
F(S + s)-F(S). 
• If h <£ V(B) then 

Gain(F a;/l , A, (q, r)) > = Gain(F a;/l , B, (q, r)) 
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Algorithm 1 Worst Case Greedy 
U H <= H 

2: S 0_ 

3: while F a (S) < a do 

4: g argmax 9 . eQ mm heV{S) min rteqz{h) (F a (S + {q,, r,)) - F a {S))/c{qA 
5: Ask q and receive response f 
6: S^S + (q,r) 
7: end while 

• If r £ q(h) then 

Gain(F a , h) A, {q, r))=a- F a , h (A) > a - F a , h (B) = Gain(F a;/l , B, (q, r)) 

• If r e g(ft) and ft e V(B) then 

Gain(_F Q ,j l , A, (q, r)) = min(F h (A + (g, r)), a) - mm(F h (A), a) 

> mm(F h (B + (q,r)),a)-mm(F h (B),a) = Gain(F a>/l , B, (q, r)) 

Here we used the submodularity of min(F/ l (S'), a) [16]. 

□ 

Algorithm 1 shows the worst case greedy algorithm which at each step picks the question qi that maximizes the 
worst case gain of F a 

min min (F a (S + (q h n)) - F a (S))/c(qi) 

heV(S) riEqi(h) 

We now argue that Algorithm 1 is an approximately optimal algorithm for interactive submodular set cover. Note 
that although it is a simple greedy algorithm over a single submodular objective, the standard submodular set cover 
analysis doesn't apply: the objective function is defined over question-response pairs, and the algorithm cannot predict 
the actual objective function gain until after selecting and commiting to a question and receiving a response. We use 
an Extended Teaching Dimension style analysis [10] inspired by previous work in query learning. We are the first to 
our knowledge to use this kind of proof for a submodular optimization problem. 

Define an oracle (teacher) T e E9 to be a function mapping questions to responses. As a short hand, for a 
sequence of questions Q define 

T(Q)= Ui(*' T fe))} 

<j,eQ 

T(Q) is the set of question-response pairs received when T is used to answer the questions in Q. We now define a 
quantity analogous to the General Identification Cost for exact active learning [10]. Define the General Cover Cost, 

GCC 

GCC = max ( min C (Q)) 

TERQ Q:F a (T(Q))>a 

GCC depends on H, Q, a, c, and the objective functions Fh, but for simplicity of notation this dependence is 
suppressed. GCC can be viewed as the cost of satisfying F a (T(Q) ) > a for worst case choice of T where the choice 
of T is known to the algorithm selecting Q. Here the worst case choice of T is over all mappings between Q and R. 
There is no restriction that T answer questions in a manner consistent with any hypothesis h e H . 
We first show that GCC is a lower bound on the optimal worst case cost of satisfying F h * (S) > a. 

Lemma 5. If there is a correct question asking strategy for satisfying Fh* (S) > a with worst case cost C* then 
GCC < C*. 

Proof. Assume the lemma is false and there is a correct question asking strategy with worst case cost C* and GCC > 
C*. Using this assumption and the definition of GCC, there is some oracle T* such that 

min c(Q) = GCC > C* 

Q:F a (T'(Q))> a 
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When we use T* to answer questions, any sequence of questions Q with total cost less than or equal to C* must have 
F a (S) < a. F a (S) < a in turn implies Fh* (S) < a for some target hypothesis choice h* G V(S). This contradicts 
the assumption there is a correct strategy with worst case cost C*. □ 

We now establish that when GCC is small, there must be a question which increases F a . 

Lemma 6. For any initial set of questions-response pairs S, there must be a question q e Q such that 

min min F a (S + (q,r)) - F a (S) > c(q)(a - F a (S))/GCC 

heV(S) req(h) 

Proof. Assume the lemma is false and for every question q there is some h e V(S) and r e q(h) such that 

F a (S + (q,r)) - F a (S) < c(q)(a - F a (S))/GCC 

Define an oracle T" which answers every question with a response satisfying this inequality. For example, one such 
/'is 

T'(q)± & Tgmm r F a (S + (q,r))-F a (S) 

By the definition of GCC 

min c(Q)) < max ( min c(Q)) = GCC 

Q:F a (T'(Q))> a T£RQ Q:F a (T(Q))>a 

so there must be a sequence of questions Q with c(Q) < GCC such that F a (T'(Q)) > a. Because F a is monotone 
non-decreasing, we also know F a (T'(Q) U S) > a. Using the submodularity of F a , 

F a (T'(Q)US) < F a (S) + J2(F a (SU{(q,T(q))})-F a (S)) 

< F a (S) + Y,c(q)(a-F a (S))/GCC <a 

which is a contradiction. □ 
We can now show approximate optimality. 

Theorem 1. Assume that a is an integer and, for any h £ H, Fh is an integral monotone non-decreasing submodular 
function. Algorithm 1 incurs at most GCC(1 + ln(an)) cost. 

Proof. Let fa be the question asked on the ith iteration, Si be the set of question-response pairs after asking fa and d 
be J2j<i c (9j')- fi y Lemma 6 

F a (Si) - F a (^_i) > c(fa)(a - F^Si-^/GCC 

After some algebra we get 

a - F a (Si) < (a - F a (5i_i))(l - c(fa)/GCC) 

Now using 1 — x < e~ x 

a - F^Si) <(a- F a (5 i _ 1 ))e" c (^)/ GCC = ae" c */ GCC 

We have shown that the gap a — F a (Si) decreases exponentially fast with the cost of the questions asked. The 
remainder of the proof proceeds by showing that (1) we can decrease the gap to 1/\H\ using questions with at most 
GCC \n(a\H\) cost and (2) we can decrease the gap from l/|i?| to with one question with cost at most GCC. 
Let j is the largest integer such that a — F a (Sj) > 1/\H\ holds. Then 

1/\H\ < ae- c ^ GCC 

Solving for Cj we get C 3 < GCC\n(a\H\). This completes (1). 

By Lemma 6, F a (Si) < F a (Si+i) (we strictly increase the objective on each iteration). Because a is an integer 
and for every h Fh is an integral function, we can conclude F a (Si) < F a (Si + i) + 1/\H\. Then qj +i will be the final 
question asked. By Lemma 6, qj+i can have cost no greater than GCC. This completes (2). We can finally conclude 
the cost incurred by the greedy algorithm is at most GCC{1 + \n(a\H |)) □ 
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By combining Theorem 1 and Lemma 5 we get 

Corollary 1. For integer a and integral monotone non-decreasing submodular Fh, the worst case cost of Algorithm 
1 is within 1 + ln(a\H |) of that of any other correct question asking strategy 

We have shown a result for integer valued a and objective functions. We speculate that for more general non- 
integer objectives it should be possible to give results similar to those for standard submodular set cover [18]. These 
approximation bounds typically add an additional normalization term. 

6 Negative Results 

6.1 Naive Greedy 

The algorithm we propose is not the most obvious approach to the problem. A more direct extension of the standard 
submodular set cover algorithm is to choose at each time step a question qi which has not been asked before and that 
maximizes the worst case gain of Fh* . In other words, chose the question qt that maximizes 

min min {F h (S + (q h n)) - F h {S))/c{ qi ) 

h£V(S) rieqi(h) 

This is in contrast to the method we propose that maximizes the worst-case gain of F a instead of Fh- We call this 
strategy the Naive Greedy Algorithm. This algorithm in general performs much worse than the optimal strategy. The 
counter example is very similar to that given by Krause et al. [14] for the equivalent approach in the non-interactive 
setting. 

Theorem 2. Assume Fh is integral for all h G H and a is integer. The Naive Greedy Algorithm has approximation 
ratio at least Q(a maxj c(qi)/ min^ c(qi j). 

Proof. Consider the following example with \H\ = 2,\Q\ = a + 2, \R\ = 1 and a > 1. When \R\ — 1 responses re- 
veal no information about h*, so the interactive problem is equivalent to the non-interactive problem, and the objective 
function only depends on the set of questions asked. Let Fh ± and Fh 2 be modular functions defined by 

F hl (qi) = a F hl (q 2 )=0 
FhM=0 F h2 (q 2 )=a 

and, for all h and all qi with % > 2, Fh{qi) = 1. The optimal strategy asks q\ and q 2 (since h* is unknown we must 
ask both). However, the worst-case gain of asking qi or q 2 is zero while the gain of asking for i > 2 is l/c(qi). 
The Naive Greedy Algorithm will then always ask every for i > 2 before asking q\ and q 2 no matter how large 
c(qi) is compared to c(q\) and c(q 2 ). By making c(qi) for i > 2 large compared to c(q\) and c(q 2 ) we get the claimed 
approximation ratio. □ 

6.2 Learn then Cover 

The method we propose for interactive submodular set cover simultaneously solves the learning problem and covering 
problem in parallel, only solving the learning problem to the extent that it helps solve the covering problem. A simpler 
strategy is to solve these two problems in series (i.e. first identify h* using the standard greedy query learning algorithm 
and second solve the submodular set cover problem for Fh* using the standard greedy set cover algorithm). We call 
this the Learn then Cover approach. We show that this approach and in fact any approach that identifies h* exactly can 
perform very poorly. Therefore it is important to consider the learning problem and covering problem simultaneously. 

Theorem 3. Assume Fh is integer for all h and that a is an integer. Any algorithm that exactly identifies h* has 
approximation ratio at least £l(\H\ max^ c(qi)/ min^ c(qi)). 

Proof. We give a simple example for which the learning problem (identifying h*) is hard but the interactive submod- 
ular set cover problem (satisfying Fh* (S) > a) is easy. For i E l...\H\ let qi(hj) = {1} if i — j and qi(hj) = {0} 
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if i ^ j. For i = \H\ + 1 let qi(hj) = {0} for all j. For worse case choice of h*, we need ask every question qi for 
i G l...|J?| in order to identify h*. However, if we define the objective to be 

F h (S) ± I((q m+1 ,0) G S) 

for all h with a = 1, the interactive submodular set cover problem is easy. To satisfy F/,. (S) > a we simply need 
to ask question <f|j/|+i. By making the cost of q\H\+i small and the cost of the other questions large, we get an 
approximation ratio of at least \H\ maxi c(qi)/ min^ c(g r j). □ 



6.3 Adaptivity Gap 

Another simple approach is to ignore feedback and solve the covering problem for all h e H. We call this the Cover 
All method. This method is an example of a non-adaptive method: a non-adaptive (i.e. non interactive) method is any 
method that does not use responses to previous questions in deciding which question to ask next. The adaptivity gap 
[6] for a problem characterizes how much worse the best non-adaptive method can perform as compared to the best 
adaptive method. For interactive submodular set cover we define the adaptivity gap to be the maximum ratio between 
the cost of the optimal non-adaptive strategy and the optimal adaptive strategy. With this definition, we can show that, 
in contrast to related problems [1] where the adaptivity gap is a constant, the adaptivity gap for interactive submodular 
set cover is quite large. 

Theorem 4. The adaptivity gap for interactive submodular set cover is at least tt(\H\/ In \H\). 

Proof. The result follows directly from the connection to active learning (Section 3.2) and in particular any example 
of exact active learning giving an exponential speed up over passive learning. A classic example is learning a threshold 
on a line [4]. Let \H\ — 2 k for some integer k > 0. Define the active learning objective as before 

F h (S)±\H\V(S)\ 

for all h. The goal of the problem is to identify h*. We define the query set such that we can identify h* through binary 
search. Let there be a query q^ corresponding to each hypothesis hi. Let q%{hj) = {1} if i < j and q%{hj) = {0} if 
i > j. Each qi can be thought of as a point on a line with hi the binary classifier which classifies all points as positive 
which are less than or equal to qi. By asking question q 2 k-i we can eliminate half of H from the version space. We 
can then recurse on the remaining half of H and identify h* in k queries. Any non-adaptive strategy on the other hand 
must perform all 2 k queries in order to ensure V(S) | = 1 for worst case choice of h*. □ 

This result shows, even if we optimally solve the submodular set cover problem, the Cover All method can incur 
exponentially greater cost than the optimal adaptive strategy. 



6.4 Hardness of Approximation 

We show that the 1 + ln(a|iJ|) approximation factor achieved by the method we propose is in fact the best possible 
up to the constant factor assuming there are no slightly superpolynomial time algorithms for NP. The result and proof 
are very similar to those for the non-interactive setting [14]. 

Theorem 5. Interactive submodular set cover cannot be approximated within a factor of(l — e) max(ln \H\, In a) in 
polynomial time for any e > unless NP has ri°' log log ™) time deterministic algorithms. 

Proof. We show the result by reducing set cover to interactive submodular set cover in two different ways. In the first 
reduction, a set cover instance of size n gives an interactive submodular set cover of with \H\ = 1 and a = n. In the 
second reduction, a set cover instance of size n gives an interactive submodular set cover instance with \H\ = n and 
a = 1. The theorem then follows from the result of Feige [7] which shows a set cover cannot be approximated within 
a factor of (1 — e) In n in polynomial time for any e > unless NP has n°( log log ™) time deterministic algorithms. 

Let V be the set of sets defining the set cover problem. The ground set is IJugy v - The g° a l °f set cover is to find a 
small set of sets S C V such that Uses s = U^ev v - ^ or ^ )ot ' 1 reductions we use \R\ = 1 (all questions have only one 
response) and make each question in Q correspond to a set in V. For a set of question-response pairs S define V§ to 
be the subset of V corresponding to the questions in S. For the first reduction with \H\ = 1, we set the one objective 
function F h (S) = \ \J veV „ v\. With a = n, we have that F a (S) — a iff Vg forms a cover. 
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Data Set / Hypothesis Class 


Simultaneous Learning and Covering 


Learn then Cover 


rw /at- All 


Enron / Clusters 


130.04 


1 A 1 c 1 
101.81 


jyjyLAA) 


^IiysiCS / V^-1USLCIS 


l / j.y 1 


177 88 


on 


Physics Theory / Clusters 


17^ 18 
1 1 L.Jo 


175 \1 


J 1 /u.uu 


Epinions / Clusters 


774.81 


779.23 


15777 00 

A-^J 1 1 1 - \J\J 


Slashdot / Clusters 


709.30 


715.39 


15383.00 


Enron / Noisy Clusters 


179.00 


231.03 


3091.00 


Physics / Noisy Clusters 


186.13 


225.02 


3340.00 


Physics Theory / Noisy Clusters 


160.62 


201.24 


3170.00 


Epinions / Noisy Clusters 


788.52 


788.06 


15777.00 


Slashdot / Noisy Clusters 


804.87 


804.86 


15383.00 



Table 1: Average number of queries required to find a dominating set in the target group. 

For the second reduction with \H\ = n, define F^ (S) for the zth hypothesis hi to be 1 iff the ith object in the 
ground set of the set cover problem is covered by V§. More formally F hz (S) = I(vi 6 V§) where Vi is the ith item 
in the ground set (ordered arbitrarily). This is similar to the first reduction except we have broken down the objective 
into a sum over the ground set elements. With a = 1, we then have that F a (S) = a iff V§ forms a cover. □ 

The approximation factor we have shown for the greedy algorithm is 

l + ]n(a\H\) = l + lna + ln|i?| < 1 + 2 max(ln \H |, In a) 
so our hardness of approximation result matches up to the constant factor and lower order term. 

7 Experiments 

We tested our method on the interactive dominating set problem described in Section 4. In this problem, we are 
given a graph and H is a set of possibly overlapping clusters of nodes. The goal is to find a small set of nodes 
which forms a dominating set of an initially unknown target group h* E H. After selecting each node, we receive 
feedback indicating if the selected node is in the target group. Our proposed method (Simultaneous Learning and 
Covering) simultaneously learns about the target group h* and finds a dominating set for it. We compare to two 
baselines: a method which first exactly identifies h* and then finds a dominating set for the target group (Learn 
then Cover) and a method which simply ignores feedback and finds a dominating set for the union of all clusters 
(Cover All). Note that Theorem 3 and Theorem 4 apply to Learn then Cover and Cover All respectively, so these 
methods do not have strong theoretical guarantees. However, we might hope however that for reasonable real world 
problems they perform well. We use real world network data sets with simple synthetic hypothesis classes designed 
to illustrate differences between the methods. The networks are from Jure Leskovec's collection of datasets available 
at http : / / snap . Stanford . edu/ data /index . html. We convert all the graphs into undirected graphs and 
remove self edges. 

Table 1 shows our results. Each reported result is the average number of queries over 100 trials. Bolded results 
are the best methods for each setting with multiple results bolded when differences are not statistically significant 
(within p = .01 with a paired t-test). In the first set of results (Clusters), we create H by using the METIS graph 
partition package 4 separate times partitioning the graph into 10, 20, 30, and 40 clusters. H is the combined set of 
100 clusters, and these clusters overlap since they are taken from 4 separate partitions of the graph. The target h* 
is chosen at random from H. With this hypothesis class, we've found that there is very little difference between the 
Simultaneous Learning and Covering and the Learn then Cover methods. The Cover All method performs significantly 
worse because without the benefit of feedback it must find a dominating set of the entire graph. 

In the second set of results, we use a hypothesis class designed to make learning difficult (Noisy Clusters). We 
start with H generated as before. We then add to H 100 additional hypotheses which are each very similar to h* . 
Each of these hypotheses consists of the target group h* with a random member removed. H is then the combined set 
of the 100 original hypotheses and these 100 variations of h*. For this hypothesis class, Learn then Cover performs 
significantly worse than our Simultaneous Learning and Covering method on 3 of the 5 data sets. Learn then Cover 
exactly identifies h*, which is difficult because of the many hypotheses similar to h*. Our method learns about h* but 
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only to the extent that it is helpful for finding a small dominating set. On the other two data sets Learn then Cover and 
Simultaneous Learning and Covering are almost identical. These are larger data sets, and we've found that when the 
covering problem requires many more queries than the learning problem, our method is nearly identical to Learn then 
Cover. This makes sense since when a is large compared to the sum over Fh(S) the second term in F a dominates. 

It is also possible to design hypothesis classes for which Cover All outperforms Learn then Cover: we found this 
is the case when the learning problem is difficult but the subgraph corresponding to the union of all clusters in H 
is small. In the appendix we give an example of this. In all cases, however, our approach does about as good or 
better than the best of these two baseline methods. Although we use real world graph data, the hypothesis classes and 
target hypotheses we use are very simple and synthetic, and as such these experiments are primarily meant to provide 
reasonable examples in support of our theoretical results. 

8 Future Work 

We believe there are other interesting applications which can be posed as interactive submodular set cover. In some 
applications it may be difficult to compute F a exactly because H may be very large or even infinite. In these cases, 
it may be possible to approximate this function by sampling from H . It's also important to consider methods that can 
handle misspecified hypothesis classes and noise within the learning. One approach could be to extend agnostic active 
learning [2] results to a similar interactive optimization setting. 
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Data Set / Hypothesis Class 
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{~^r\\7ar All 


Enron / Balls 


1 5 11 


14 7Q 






78 8"* 


78 8d 


1U7U.J0 


Physics Theory / Balls 


78 Id 






Pnini c\r\Q /"Rnlle 

J_/L/llll>Jllft / ±J ill 1 j 


19.53 


18.37 


829.69 


81nchrlr»t / Roll 


1 8 11 


17 T\ 


CK7 HQ 


Enron / Noisy Balls 


8 16 


97 1 1 




PVi i/cipc / Wi~\ ie\; TJ q 1 1 c 
riiyaics / INUlay Dalla 


71 11 




JO.J 1 


Physics Theory / Noisy Balls 


77 d8 


AA 87 


11 A1 


Epinions / Noisy Balls 


15.03 


32.76 


18.11 


Slashdot / Noisy Balls 


12.81 


32.53 


31.38 


Enron / Expanded Clusters 


84.90 


84.23 


3091.00 


Physics / Expanded Clusters 


150.28 


152.21 


3340.00 


Physics Theory / Expanded Clusters 


120.84 


122.12 


3170.00 


Epinions / Expanded Clusters 


260.21 


261.01 


15777.00 


Slashdot / Expanded Clusters 


324.15 


325.35 


15383.00 



Table 2: Average number of queries required to find a dominating set in the target community. 
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A Additional Experiments 

Table 2 shows additional experimental results using different hypothesis classes. In the first set of results, we use a 
hypothesis class H consisting of 100 randomly chosen geodesic balls of radius 2 (Balls). Each group h E H is formed 
by choosing a node uniformly at random from the graph and then finding all nodes within a shortest path distance of 
2. The target group h* is then selected at random from H. With this hypothesis class, we've found that there is very 
little difference between the Simultaneous Learning and Covering and the Learn then Cover methods, similar to the 
Clusters hypothesis class in Table 1 . Learn then Cover is better on 3 of the 5 data sets, but the difference is very small 
(around 1 query). The Cover All method again performs significantly worse because it must find a dominating set of 
all 100 of the geodesic balls. 

In the second set of results, Noisy Balls, we use a hypothesis class similar to the Noisy Clusters hypothesis class in 
Table 1 but using random geodesic balls. We first generate 2 core groups by sampling random geodesic balls of radius 
2 as before. We then generate 50 small variations of each of these 2 core groups, each consisting of the core group with 
a random member removed. H is this set of 100 variations, and the target group h* is again selected at random from 
H. For this hypothesis class, Simultaneous Learning and Covering outperforms the other methods because it learns 
about h* but only to the extent that it is helpful for finding a small dominating set. Cover All actually outperforms 
Learn then Cover with this hypothesis class, because the total number of vertices in the union of all clusters in H is 
small. 

In the third set of results denoted Expanded Clusters, we create H by partitioning the graph into 100 clusters using 
the METIS [11] graph partitioning package and then expand each of these clusters to include its immediate neighbors. 
This creates a set of 100 overlapping clusters with shared vertices on the fringes of each cluster. As before the target 
hypothesis is selected at random from H. We have found that results with this hypothesis class are similar to those 
with the Balls and Clusters hypothesis class. 
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